﻿【Do not distribute】
This is a README file of the submitted paper No.10813 of 
the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022): 
"MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching", 
with which the Supplementary Material (as well as source codes) is affiliated. 

The source codes, model files, datasets, VLKB knowledge dictionaries of this paper 
are all in the "NeurIPS_2022_No_10813_MACK_Supplementary_Material" directory. 


【environment】
pip install python==3.6, numpy, h5py, pickle


【terminal/shell command】
conda activate <your_env_name>
cd <your_path>/NeurIPS_2022_No_10813_MACK_Supplementary_Material/rerank/


【python evaluation command(CPU only)】
【CLIP】
## You can try various combinations of the following hyper-params to re-produce the experimental results of MACK. 
## Note: cosine metric for [CLIP, VSRN, SAEM, ALBEF(coarse-grained)]; softmax metric for [UNITER, OSCAR]
【f30k t2i】
python nips22_MACK_clip_t2i_i2t_jj.py \
--vocab='./vocab_idx_word/vg_vocab.json' \
--p_feas='./p_feas_vlkb_word_idx_region_feat/p_feas.npy' \
--tags='./tags_NN/tags' \
--parses='./parses_JJ/parses' \
--img_feats='./bu_precomp_feats/f30k_test_buctxbox.h5' \
--sims='./base_sims/CLIP/f30k_RN50x16 test embedding_sim.npy' \
--metric='cosine' \
--test_mode='t2i' \
--test_type='1K' \
--top_k=15 \
--NN_scale=1.0 \
--JJ_scale=1.0 \
--t2i_scale=0.1 \
--i2t_scale=0.03 \
--T=1.0 \
--max_mean='max_mean' \
--output='./output/date_xx_xx_xx_output' \

65.4, 87.2, 91.7, 244.3 ## base model (R@1/5/10/sum)【Table 2, line 1, Image Annotation】
24.36, 65.24, 86.14, 175.74 ## rerank (through VLKB knowledge from top_k result of base model)
66.84, 88.26, 92.62, 247.72000000000003 ## base+rerank (performance improved by the proposed MACK VLKB knowledge)【Table 2, line 2, Image Annotation】


【f30k i2t】
python nips22_MACK_clip_t2i_i2t_jj.py \
--vocab='./vocab_idx_word/vg_vocab.json' \
--p_feas='./p_feas_vlkb_word_idx_region_feat/p_feas.npy' \
--tags='./tags_NN/tags' \
--parses='./parses_JJ/parses' \
--img_feats='./bu_precomp_feats/f30k_test_buctxbox.h5' \
--sims='./base_sims/CLIP/f30k_RN50x16 test embedding_sim.npy' \
--metric='cosine' \
--test_mode='i2t' \
--test_type='1K' \
--top_k=15 \
--NN_scale=1.0 \
--JJ_scale=1.0 \
--t2i_scale=0.1 \
--i2t_scale=0.03 \
--T=1.0 \
--max_mean='max_mean' \
--output='./output/date_xx_xx_xx_output' \

85.4, 97.1, 98.7, 281.2 ## base model (R@1/5/10/sum)【Table 2, line 1, Image Retrieval】
39.8, 89.7, 98.4, 227.9 ## rerank (through VLKB knowledge from top_k result of base model)
86.2, 97.2, 98.9, 282.3 ## base+rerank (performance improved by the proposed MACK VLKB knowledge)【Table 2, line 2, Image Retrieval】


【hyper-params】
[VLKB dictionary]
    --vocab ## VLKB dictionary (idx <-> word)【see dir "vocab_idx_word"】
    --p_feas ## VLKB dictionary (word idx -> prototype region feature)【see dir "p_feas_vlkb_word_idx_region_feat"】
[word/region@dataset]
    --tags ## one word annotation ('NN' is n.) by StanfordPOSTagger【see dir "tags_NN"】
    --parses ## two words' relation annotation ('JJ' is adj.) by StanfordDependencyParser【see dir "parses_JJ"】
    --img_feats ## (precomp bu/bottom-up region feats) from SCAN[ECCV 18]/VSRN[ICCV 19]【see dir "bu_precomp_feats"】
[base model@dataset]
    --sims ## test similarity matrix, cosine similarity or probability score【see dir "base_sims"】
    --metric ## cosine metric for [CLIP, VSRN, SAEM, ALBEF(coarse-grained)]; softmax metric for [UNITER, OSCAR]【"cosine"/"softmax"】
[test]
    --test_mode ## test mode (t2i/i2t/t2i+i2t)
    --test_type ## test type ['', '1K', '5-fold-1K', '5K', ]
    --top_k ## top k results by base model
[hyper-params]
    --NN_scale ## when *NN* + *JJ*, scale factor of NN 
    --JJ_scale ## when *NN* + *JJ*, scale factor of JJ 
    --t2i_scale ## scale factor of reranked t2i sim matrix 
    --i2t_scale ## scale factor of reranked t2i sim matrix 
    --T ## temperature of softmax 
    --max_mean ## pooling type 
[else]
    --output ## (unused)


【introduction】
vocab_idx_word & p_feas_vlkb_word_idx_region_feat: 
    The MACK Vision-Language Knowledge Base (MACK-VLKB) is based on the Visual Genome dataset[IJCV 17]. 
    The Visual Genome dataset has 3.8M(3,802,374) object-words annotations, with 27801 different words in total. 
    "vocab_idx_word" is a bi-directional table/dictionary that can map word to index, as well as index to word. 
        word2idx dict: str      -> int
        idx2word dict: str(int) -> str
        Length: 27801
    "p_feas_vlkb_word_idx_region_feat" is a table/dictionary that can map index of word from "vocab_idx_word" to prototype features 
    by bottom-up attention region encoder(BUTD Faster R-CNN[CVPR 18]). 
        Shape: (27801, 2048)
tags_NN & parses_JJ: 
    Tag/relation annotations of the texts from Flickr30k(F30k)/MSCOCO(COCO) dataset test split. 
    "tags_NN" is the tag annotation of one word by StanfordPOSTagger. ('NN' is n.)
        e.g. in the description "orange hat", the word "hat" is a "NN" concept. 
    "parses_JJ" is the relation annotation between two words by StanfordDependencyParser. ('JJ' is adj.)
        e.g. in the description "orange hat", the word "orange" is a "JJ" concept related to the "NN" concept "hat". 
    NN & JJ are the only two concerns. 
bu_precomp_feats: 
    Bottom-up attention(Faster R-CNN 101) pre-computed features (from BUTD[CVPR 18] and SCAN[ECCV 18]) of the Flickr30k(F30k)/MSCOCO(COCO) dataset test split. 
        F30k Shape: (1000, 36, 2048) ## 1K test images with 36 bbox region/object features of 2048 dimensions. 
        COCO Shape: (5000, 36, 2048) ## 5K test images with 36 bbox region/object features of 2048 dimensions. 
base_sims: 
    Base models of open-sourced Image-text Matching(ITM). 
        F30k Shape: (1000, 5000) ## 1K Test only 
        COCO Shape: (5000, 25000) ## 5K/5-fold-1K/1K Test 

